{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# CNTK 202: Language Understanding with Recurrent Networks\n",
    "\n",
    "This tutorial shows how to implement a recurrent network to process text,\n",
    "for the [Air Travel Information Services](https://catalog.ldc.upenn.edu/LDC95S26) \n",
    "(ATIS) task of slot tagging (tag individual words to their respective classes, \n",
    "where the classes are provided as labels in the training data set).\n",
    "\n",
    "There are 2 parts to this tutorial:\n",
    "- Part 1: We will tag each word in a sequence to their corresponding label\n",
    "- Part 2: We will classify a sequence to its corresponding intent.\n",
    "\n",
    "We will start with a straight-forward (linear) embedding of the words followed by a recurrent LSTM to label each word in a sequence to the corresponding class. We will show how to classify each word token in a sequence to the corresponding class. This will then be extended to include neighboring words and run bidirectionally.\n",
    "\n",
    "We will take the last state of the sequence and train a model that classifies the entire sequence to the corresponding class label (in this case the intent associated with the sequence).\n",
    "\n",
    "The techniques you will practice are:\n",
    "* model description by composing layer blocks, a convenient way to compose \n",
    "  networks/models without requiring the need to write formulas,\n",
    "* creating your own layer block\n",
    "* variables with different sequence lengths in the same network\n",
    "* training the network\n",
    "\n",
    "We assume that you are familiar with basics of deep learning, and these specific concepts:\n",
    "* recurrent networks ([Wikipedia page](https://en.wikipedia.org/wiki/Recurrent_neural_network))\n",
    "* text embedding ([Wikipedia page](https://en.wikipedia.org/wiki/Word_embedding))\n",
    "\n",
    "## Prerequisites\n",
    "\n",
    "We assume that you have already [installed CNTK](https://docs.microsoft.com/en-us/cognitive-toolkit/Setup-CNTK-on-your-machine).\n",
    "This tutorial requires CNTK V2. We strongly recommend to run this tutorial on a machine with\n",
    "a capable CUDA-compatible GPU. Deep learning without GPUs is not fun."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data download\n",
    "\n",
    "In this tutorial, we are going to use a (lightly preprocessed) version of the ATIS dataset. You can download the data automatically by running the cells below or by executing the manual instructions.\n",
    "\n",
    "**Fallback manual instructions**\n",
    "Download the ATIS [training](https://github.com/Microsoft/CNTK/blob/release/2.7/Tutorials/SLUHandsOn/atis.train.ctf) \n",
    "and [test](https://github.com/Microsoft/CNTK/blob/release/2.7/Tutorials/SLUHandsOn/atis.test.ctf) \n",
    "files and put them at the same folder as this notebook. If you want to see how the model is \n",
    "predicting on new sentences you will also need the vocabulary files for \n",
    "[queries](https://github.com/Microsoft/CNTK/blob/release/2.7/Examples/LanguageUnderstanding/ATIS/BrainScript/query.wl) and\n",
    "[slots](https://github.com/Microsoft/CNTK/blob/release/2.7/Examples/LanguageUnderstanding/ATIS/BrainScript/slots.wl)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Reusing locally cached: query.wl\n",
      "Reusing locally cached: intent.wl\n",
      "Reusing locally cached: atis.test.ctf\n",
      "Reusing locally cached: atis.train.ctf\n",
      "Reusing locally cached: slots.wl\n"
     ]
    }
   ],
   "source": [
    "from __future__ import print_function # Use a function definition from future version (say 3.x from 2.7 interpreter)\n",
    "import requests\n",
    "import os\n",
    "\n",
    "def download(url, filename):\n",
    "    \"\"\" utility function to download a file \"\"\"\n",
    "    response = requests.get(url, stream=True)\n",
    "    with open(filename, \"wb\") as handle:\n",
    "        for data in response.iter_content():\n",
    "            handle.write(data)\n",
    "\n",
    "locations = ['Tutorials/SLUHandsOn', 'Examples/LanguageUnderstanding/ATIS/BrainScript']\n",
    "\n",
    "data = {\n",
    "  'train': { 'file': 'atis.train.ctf', 'location': 0 },\n",
    "  'test': { 'file': 'atis.test.ctf', 'location': 0 },\n",
    "  'query': { 'file': 'query.wl', 'location': 1 },\n",
    "  'slots': { 'file': 'slots.wl', 'location': 1 },\n",
    "  'intent': { 'file': 'intent.wl', 'location': 1 }  \n",
    "}\n",
    "\n",
    "for item in data.values():\n",
    "    location = locations[item['location']]\n",
    "    path = os.path.join('..', location, item['file'])\n",
    "    if os.path.exists(path):\n",
    "        print(\"Reusing locally cached:\", item['file'])\n",
    "        # Update path\n",
    "        item['file'] = path\n",
    "    elif os.path.exists(item['file']):\n",
    "        print(\"Reusing locally cached:\", item['file'])\n",
    "    else:\n",
    "        print(\"Starting download:\", item['file'])\n",
    "        url = \"https://github.com/Microsoft/CNTK/blob/release/2.7/%s/%s?raw=true\"%(location, item['file'])\n",
    "        download(url, item['file'])\n",
    "        print(\"Download completed\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Importing libraries**: CNTK, math and numpy \n",
    "\n",
    "CNTK's Python module contains several submodules like `io`, `learner`, and `layers`. We also use NumPy in some cases since the results returned by CNTK work like NumPy arrays."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import math\n",
    "import numpy as np\n",
    "\n",
    "import cntk as C\n",
    "import cntk.tests.test_utils\n",
    "cntk.tests.test_utils.set_device_from_pytest_env() # (only needed for our build system)\n",
    "C.cntk_py.set_fixed_random_seed(1) # fix a random seed for CNTK components"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task overview: Slot tagging\n",
    "\n",
    "The task we want to approach in this tutorial is slot tagging.\n",
    "We use the [ATIS corpus](https://catalog.ldc.upenn.edu/LDC95S26).\n",
    "ATIS contains human-computer queries from the domain of Air Travel Information Services,\n",
    "and our task will be to annotate (tag) each word of a query whether it belongs to a\n",
    "specific item of information (slot), and which one.\n",
    "\n",
    "The data in your working folder has already been converted into the \"CNTK Text Format.\"\n",
    "Let us look at an example from the test-set file `atis.test.ctf`:\n",
    "\n",
    "    19  |S0 178:1 |# BOS      |S1 14:1 |# flight  |S2 128:1 |# O\n",
    "    19  |S0 770:1 |# show                         |S2 128:1 |# O\n",
    "    19  |S0 429:1 |# flights                      |S2 128:1 |# O\n",
    "    19  |S0 444:1 |# from                         |S2 128:1 |# O\n",
    "    19  |S0 272:1 |# burbank                      |S2 48:1  |# B-fromloc.city_name\n",
    "    19  |S0 851:1 |# to                           |S2 128:1 |# O\n",
    "    19  |S0 789:1 |# st.                          |S2 78:1  |# B-toloc.city_name\n",
    "    19  |S0 564:1 |# louis                        |S2 125:1 |# I-toloc.city_name\n",
    "    19  |S0 654:1 |# on                           |S2 128:1 |# O\n",
    "    19  |S0 601:1 |# monday                       |S2 26:1  |# B-depart_date.day_name\n",
    "    19  |S0 179:1 |# EOS                          |S2 128:1 |# O\n",
    "\n",
    "This file has 7 columns:\n",
    "\n",
    "* a sequence id (19). There are 11 entries with this sequence id. This means that sequence 19 consists\n",
    "of 11 tokens;\n",
    "* column `S0`, which contains numeric word indices; the input data is encoded in one-hot vectors.  There are 943 words in the vocabulary, so each word is a 943 element vector of all 0 with a 1 at a vector index chosen to represent that word.  For example the word \"from\" is represented with a 1 at index 444 and zero everywhere else in the vector. The word \"monday\" is represented with a 1 at index 601 and zero everywhere else in the vector.\n",
    "* a comment column denoted by `#`, to allow a human reader to know what the numeric word index stands for;\n",
    "Comment columns are ignored by the system. `BOS` and `EOS` are special words\n",
    "to denote beginning and end of sentence, respectively;\n",
    "* column `S1` is an intent label, which we will use in the second part of the tutorial;\n",
    "* another comment column that shows the human-readable label of the numeric intent index;\n",
    "* column `S2` is the slot label, represented as a numeric index; and\n",
    "* another comment column that shows the human-readable label of the numeric label index.\n",
    "\n",
    "The task of the neural network is to look at the query (column `S0`) and predict the\n",
    "slot label (column `S2`).\n",
    "As you can see, each word in the input gets assigned either an empty label `O`\n",
    "or a slot label that begins with `B-` for the first word, and with `I-` for any\n",
    "additional consecutive word that belongs to the same slot.\n",
    "\n",
    "### Model Creation\n",
    "\n",
    "The model we will use is a recurrent model consisting of an embedding layer,\n",
    "a recurrent LSTM cell, and a dense layer to compute the posterior probabilities:\n",
    "\n",
    "\n",
    "    slot label   \"O\"        \"O\"        \"O\"        \"O\"  \"B-fromloc.city_name\"\n",
    "                  ^          ^          ^          ^          ^\n",
    "                  |          |          |          |          |\n",
    "              +-------+  +-------+  +-------+  +-------+  +-------+\n",
    "              | Dense |  | Dense |  | Dense |  | Dense |  | Dense |  ...\n",
    "              +-------+  +-------+  +-------+  +-------+  +-------+\n",
    "                  ^          ^          ^          ^          ^\n",
    "                  |          |          |          |          |\n",
    "              +------+   +------+   +------+   +------+   +------+   \n",
    "         0 -->| LSTM |-->| LSTM |-->| LSTM |-->| LSTM |-->| LSTM |-->...\n",
    "              +------+   +------+   +------+   +------+   +------+   \n",
    "                  ^          ^          ^          ^          ^\n",
    "                  |          |          |          |          |\n",
    "              +-------+  +-------+  +-------+  +-------+  +-------+\n",
    "              | Embed |  | Embed |  | Embed |  | Embed |  | Embed |  ...\n",
    "              +-------+  +-------+  +-------+  +-------+  +-------+\n",
    "                  ^          ^          ^          ^          ^\n",
    "                  |          |          |          |          |\n",
    "    w      ------>+--------->+--------->+--------->+--------->+------... \n",
    "                 BOS      \"show\"    \"flights\"    \"from\"   \"burbank\"\n",
    "\n",
    "Or, as a CNTK network description. Please have a quick look and match it with the description above:\n",
    "(descriptions of these functions can be found at: [the layers reference](http://cntk.ai/pythondocs/layerref.html))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# number of words in vocab, slot labels, and intent labels\n",
    "vocab_size = 943 ; num_labels = 129 ; num_intents = 26    \n",
    "\n",
    "# model dimensions\n",
    "input_dim  = vocab_size\n",
    "label_dim  = num_labels\n",
    "emb_dim    = 150\n",
    "hidden_dim = 300\n",
    "\n",
    "# Create the containers for input feature (x) and the label (y)\n",
    "x = C.sequence.input_variable(vocab_size)\n",
    "y = C.sequence.input_variable(num_labels)\n",
    "\n",
    "def create_model():\n",
    "    with C.layers.default_options(initial_state=0.1):\n",
    "        return C.layers.Sequential([\n",
    "            C.layers.Embedding(emb_dim, name='embed'),\n",
    "            C.layers.Recurrence(C.layers.LSTM(hidden_dim), go_backwards=False),\n",
    "            C.layers.Dense(num_labels, name='classify')\n",
    "        ])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we are ready to create a model and inspect it. \n",
    "\n",
    "The model attributes are fully accessible from Python. The first layer named `embed` is an Embedding layer. Here we use the CNTK default, which is linear embedding. It is a simple matrix with dimension (input word encoding x output projected dimension). You can access its parameter `E` (where the embeddings are stored) like any other attribute of a Python object. Its shape contains a `-1` which indicates that this parameter (with input dimension) is not fully specified yet, while the output dimension is set to `emb_dim` ( = 150 in this tutorial). \n",
    "\n",
    "Additionally, we also inspect the value of the bias vector in the `Dense` layer named `classify`. The `Dense` layer is a fundamental compositional unit of a Multi-Layer Perceptron (as introduced in CNTK 103C tutorial). The `Dense` layer has both `weight` and `bias` parameters, one each per `Dense` layer. Bias terms are by default initialized to 0 (but there is a way to change that if you need). As you create the model, one should name the layer component and then access the parameters as shown here. \n",
    "\n",
    "**Suggested task**: What should be the expected dimension of the `weight` matrix from the layer named `classify`? Try printing the weight matrix of the `classify` layer? Does it match with your expected size?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(-1, 150)\n",
      "[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.\n",
      "  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.\n",
      "  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.\n",
      "  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.\n",
      "  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.\n",
      "  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.\n",
      "  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.\n",
      "  0.  0.  0.]\n"
     ]
    }
   ],
   "source": [
    "# peek\n",
    "z = create_model()\n",
    "print(z.embed.E.shape)\n",
    "print(z.classify.b.value)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In our case we have input as one-hot encoded vector of length 943 and the output dimension `emb_dim` is set to 150. In the code below we pass the input variable `x` to our model `z`. This binds the model with input data of known shape. In this case, the input shape will be the size of the input vocabulary. With this modification, the parameter returned by the embed layer is completely specified (943, 150). **Note**: You can initialize the Embedding matrix with pre-computed vectors using [Word2Vec](https://en.wikipedia.org/wiki/Word2vec) or [GloVe](https://en.wikipedia.org/wiki/GloVe_%28machine_learning%29)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(943, 150)\n"
     ]
    }
   ],
   "source": [
    "# Pass an input and check the dimension\n",
    "z = create_model()\n",
    "print(z(x).embed.E.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To train and test a model in CNTK, we need to create a model and specify how to read data and perform training and testing. \n",
    "\n",
    "In order to train we need to specify:\n",
    "\n",
    "* how to read the data \n",
    "* the model function, its inputs, and outputs\n",
    "* hyper-parameters for the learner such as the learning rate\n",
    "\n",
    "[comment]: <> (For testing ...)\n",
    "\n",
    "## Data Reading\n",
    "\n",
    "We already looked at the data.\n",
    "But how do you generate this format?\n",
    "For reading text, this tutorial uses the `CNTKTextFormatReader`. It expects the input data to be\n",
    "in a specific format, as described [here](https://docs.microsoft.com/en-us/cognitive-toolkit/Brainscript-CNTKTextFormat-Reader).\n",
    "\n",
    "For this tutorial, we created the corpora by two steps:\n",
    "* convert the raw data into a plain text file that contains of TAB-separated columns of space-separated text. For example:\n",
    "\n",
    "  ```\n",
    "  BOS show flights from burbank to st. louis on monday EOS (TAB) flight (TAB) O O O O B-fromloc.city_name O B-toloc.city_name I-toloc.city_name O B-depart_date.day_name O\n",
    "  ```\n",
    "\n",
    "  This is meant to be compatible with the output of the `paste` command.\n",
    "* convert it to CNTK Text Format (CTF) with the following command:\n",
    "\n",
    "  ```\n",
    "  python [CNTK root]/Scripts/txt2ctf.py --map query.wl intent.wl slots.wl --annotated True --input atis.test.txt --output atis.test.ctf\n",
    "  ```\n",
    "  where the three `.wl` files give the vocabulary as plain text files, one word per line.\n",
    "\n",
    "In these CTF files, our columns are labeled `S0`, `S1`, and `S2`.\n",
    "These are connected to the actual network inputs by the corresponding lines in the reader definition:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def create_reader(path, is_training):\n",
    "    return C.io.MinibatchSource(C.io.CTFDeserializer(path, C.io.StreamDefs(\n",
    "         query         = C.io.StreamDef(field='S0', shape=vocab_size,  is_sparse=True),\n",
    "         intent        = C.io.StreamDef(field='S1', shape=num_intents, is_sparse=True),  \n",
    "         slot_labels   = C.io.StreamDef(field='S2', shape=num_labels,  is_sparse=True)\n",
    "     )), randomize=is_training, max_sweeps = C.io.INFINITELY_REPEAT if is_training else 1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dict_keys(['query', 'slot_labels', 'intent'])"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# peek\n",
    "reader = create_reader(data['train']['file'], is_training=True)\n",
    "reader.streams.keys()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training\n",
    "\n",
    "We also must define the training criterion (loss function), and also an error metric to track. In most tutorials, we know the input dimensions and the corresponding labels. We directly create the loss and the error functions. In this tutorial we will do the same. However, we take a brief detour and learn about placeholders. This concept would be useful for Task 3. \n",
    "\n",
    "**Learning note**: Introduction to `placeholder`: Remember that the code we have been writing is not actually executing any heavy computation it is just specifying the function we want to compute on data during training/testing. And in the same way that it is convenient to have names for arguments when you write a regular function in a programming language, it is convenient to have placeholders that refer to arguments (or local computations that need to be reused). Eventually, some other code will replace these placeholders with other known quantities in the same way that in a programming language the function will be called with concrete values bound to its arguments. \n",
    "\n",
    "Specifically, the input variables you have created above `x = C.sequence.input_variable(vocab_size)` holds data pre-defined by `vocab_size`. In the case where such instantiations are challenging or not possible, using `placeholder` is a logical choice. Having the `placeholder` only allows you to defer the specification of the argument at a later time when you may have the data.\n",
    "\n",
    "Here is an example below that illustrates the use of `placeholder`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Composite(Combine): Input('Input2300', [#, *], [129]), Placeholder('labels', [???], [???]) -> Output('Block2270_Output_0', [#, *], [1]), Output('Block2290_Output_0', [#, *], [])"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def create_criterion_function(model):\n",
    "    labels = C.placeholder(name='labels')\n",
    "    ce   = C.cross_entropy_with_softmax(model, labels)\n",
    "    errs = C.classification_error      (model, labels)\n",
    "    return C.combine ([ce, errs]) # (features, labels) -> (loss, metric)\n",
    "\n",
    "criterion = create_criterion_function(create_model())\n",
    "criterion.replace_placeholders({criterion.placeholders[0]: C.sequence.input_variable(num_labels)})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "While the cell above works well when one has input parameters defined at network creation, it compromises readability. Hence we prefer creating functions as shown below"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def create_criterion_function_preferred(model, labels):\n",
    "    ce   = C.cross_entropy_with_softmax(model, labels)\n",
    "    errs = C.classification_error      (model, labels)\n",
    "    return ce, errs # (model, labels) -> (loss, error metric)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def train(reader, model_func, max_epochs=10, task='slot_tagging'):\n",
    "    \n",
    "    # Instantiate the model function; x is the input (feature) variable \n",
    "    model = model_func(x)\n",
    "    \n",
    "    # Instantiate the loss and error function\n",
    "    loss, label_error = create_criterion_function_preferred(model, y)\n",
    "\n",
    "    # training config\n",
    "    epoch_size = 18000        # 18000 samples is half the dataset size \n",
    "    minibatch_size = 70\n",
    "    \n",
    "    # LR schedule over epochs \n",
    "    # In CNTK, an epoch is how often we get out of the minibatch loop to\n",
    "    # do other stuff (e.g. checkpointing, adjust learning rate, etc.)\n",
    "    lr_per_sample = [3e-4]*4+[1.5e-4]\n",
    "    lr_per_minibatch = [lr * minibatch_size for lr in lr_per_sample]\n",
    "    lr_schedule = C.learning_parameter_schedule(lr_per_minibatch, epoch_size=epoch_size)\n",
    "    \n",
    "    # Momentum schedule\n",
    "    momentums = C.momentum_schedule(0.9048374180359595, minibatch_size=minibatch_size)\n",
    "    \n",
    "    # We use a the Adam optimizer which is known to work well on this dataset\n",
    "    # Feel free to try other optimizers from \n",
    "    # https://www.cntk.ai/pythondocs/cntk.learner.html#module-cntk.learner\n",
    "    learner = C.adam(parameters=model.parameters,\n",
    "                     lr=lr_schedule,\n",
    "                     momentum=momentums,\n",
    "                     gradient_clipping_threshold_per_sample=15, \n",
    "                     gradient_clipping_with_truncation=True)\n",
    "\n",
    "    # Setup the progress updater\n",
    "    progress_printer = C.logging.ProgressPrinter(tag='Training', num_epochs=max_epochs)\n",
    "    \n",
    "    # Uncomment below for more detailed logging\n",
    "    #progress_printer = ProgressPrinter(freq=100, first=10, tag='Training', num_epochs=max_epochs) \n",
    "\n",
    "    # Instantiate the trainer\n",
    "    trainer = C.Trainer(model, (loss, label_error), learner, progress_printer)\n",
    "\n",
    "    # process minibatches and perform model training\n",
    "    C.logging.log_number_of_parameters(model)\n",
    "    \n",
    "    # Assign the data fields to be read from the input\n",
    "    if task == 'slot_tagging':\n",
    "        data_map={x: reader.streams.query, y: reader.streams.slot_labels}\n",
    "    else:\n",
    "        data_map={x: reader.streams.query, y: reader.streams.intent} \n",
    "\n",
    "    t = 0\n",
    "    for epoch in range(max_epochs):         # loop over epochs\n",
    "        epoch_end = (epoch+1) * epoch_size\n",
    "        while t < epoch_end:                # loop over minibatches on the epoch\n",
    "            data = reader.next_minibatch(minibatch_size, input_map= data_map)  # fetch minibatch\n",
    "            trainer.train_minibatch(data)               # update model with it\n",
    "            t += data[y].num_samples                    # samples so far\n",
    "        trainer.summarize_training_progress()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Run the trainer**\n",
    "\n",
    "You can find the complete recipe below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training 721479 parameters in 6 parameter tensors.\n",
      "Learning rate per minibatch: 0.020999999999999998\n",
      "Finished Epoch[1 of 10]: [Training] loss = 1.740198 * 18010, metric = 28.02% * 18010 6.466s (2785.3 samples/s);\n",
      "Finished Epoch[2 of 10]: [Training] loss = 0.665177 * 18051, metric = 14.30% * 18051 5.238s (3446.2 samples/s);\n",
      "Finished Epoch[3 of 10]: [Training] loss = 0.526256 * 17941, metric = 11.34% * 17941 5.198s (3451.5 samples/s);\n",
      "Finished Epoch[4 of 10]: [Training] loss = 0.395405 * 18059, metric = 8.22% * 18059 5.329s (3388.8 samples/s);\n",
      "Learning rate per minibatch: 0.010499999999999999\n",
      "Finished Epoch[5 of 10]: [Training] loss = 0.293512 * 17957, metric = 6.20% * 17957 5.106s (3516.8 samples/s);\n",
      "Finished Epoch[6 of 10]: [Training] loss = 0.264932 * 18021, metric = 5.73% * 18021 5.335s (3377.9 samples/s);\n",
      "Finished Epoch[7 of 10]: [Training] loss = 0.217258 * 17980, metric = 4.69% * 17980 5.248s (3426.1 samples/s);\n",
      "Finished Epoch[8 of 10]: [Training] loss = 0.209614 * 18025, metric = 4.55% * 18025 5.139s (3507.5 samples/s);\n",
      "Finished Epoch[9 of 10]: [Training] loss = 0.165851 * 17956, metric = 3.84% * 17956 5.636s (3185.9 samples/s);\n",
      "Finished Epoch[10 of 10]: [Training] loss = 0.157653 * 18039, metric = 3.41% * 18039 5.646s (3195.0 samples/s);\n"
     ]
    }
   ],
   "source": [
    "def do_train():\n",
    "    global z\n",
    "    z = create_model()\n",
    "    reader = create_reader(data['train']['file'], is_training=True)\n",
    "    train(reader, z)\n",
    "do_train()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This shows how learning proceeds over epochs (passes through the data).\n",
    "For example, after four epochs, the loss, which is the cross-entropy criterion, \n",
    "has reduced significantly as measured on the ~18000 samples of this epoch,\n",
    "and the same with the error rate on those same 18000 training samples.\n",
    "\n",
    "The epoch size is the number of samples--counted as *word tokens*, not sentences--to\n",
    "process between model checkpoints.\n",
    "\n",
    "Once the training has completed (a little less than 2 minutes on a Titan-X or a Surface Book),\n",
    "you will see an output like this\n",
    "```\n",
    "Finished Epoch[10 of 10]: [Training] loss = 0.157653 * 18039, metric = 3.41% * 18039\n",
    "```\n",
    "which is the loss (cross entropy) and the metric (classification error) averaged over the final epoch.\n",
    "\n",
    "On a CPU-only machine, it can be 4 or more times slower. You can try setting\n",
    "```python\n",
    "emb_dim    = 50 \n",
    "hidden_dim = 100\n",
    "```\n",
    "to reduce the time it takes to run on a CPU, but the model will not fit as well as when the \n",
    "hidden and embedding dimension are larger. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Evaluating the model\n",
    "\n",
    "Like the train() function, we also define a function to measure accuracy on a test set by computing the error over multiple minibatches of test data. For evaluating on a small sample read from a file, you can set a minibatch size reflecting the sample size and run the test_minibatch on that instance of data. To see how to evaluate a single sequence, we provide an instance later in the tutorial. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def evaluate(reader, model_func, task='slot_tagging'):\n",
    "    \n",
    "    # Instantiate the model function; x is the input (feature) variable \n",
    "    model = model_func(x)\n",
    "    \n",
    "    # Create the loss and error functions\n",
    "    loss, label_error = create_criterion_function_preferred(model, y)\n",
    "\n",
    "    # process minibatches and perform evaluation\n",
    "    progress_printer = C.logging.ProgressPrinter(tag='Evaluation', num_epochs=0)\n",
    "    \n",
    "    # Assign the data fields to be read from the input\n",
    "    if task == 'slot_tagging':\n",
    "        data_map={x: reader.streams.query, y: reader.streams.slot_labels}\n",
    "    else:\n",
    "        data_map={x: reader.streams.query, y: reader.streams.intent} \n",
    "\n",
    "    while True:\n",
    "        minibatch_size = 500\n",
    "        data = reader.next_minibatch(minibatch_size, input_map= data_map)  # fetch minibatch\n",
    "        if not data:                                 # until we hit the end\n",
    "            break\n",
    "\n",
    "        evaluator = C.eval.Evaluator(loss, progress_printer)\n",
    "        evaluator.test_minibatch(data)\n",
    "     \n",
    "    evaluator.summarize_test_progress()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can measure the model accuracy by going through all the examples in the test set and using the ``C.eval.Evaluator`` method. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Finished Evaluation [1]: Minibatch[1-23]: metric = 0.34% * 10984;\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "array([ -1.86572317e-02,  -8.51036515e-03,   1.38878925e-02,\n",
       "        -1.95176266e-02,  -2.78977025e-03,  -1.23168388e-02,\n",
       "        -6.16775267e-03,  -1.48158008e-02,  -5.82036236e-03,\n",
       "        -2.99133137e-02,  -1.39552690e-02,  -2.11144108e-02,\n",
       "        -1.13499342e-02,  -1.25011550e-02,  -8.19404377e-04,\n",
       "        -5.26463473e-03,  -2.67275460e-02,  -1.80706571e-04,\n",
       "        -3.97865893e-03,  -2.99989916e-02,  -1.00385472e-02,\n",
       "        -6.81575621e-03,  -2.65348833e-02,  -2.01367699e-02,\n",
       "        -2.63106022e-02,   4.22888156e-03,   5.74267423e-03,\n",
       "        -1.56373512e-02,   7.71288527e-04,  -2.11508083e-03,\n",
       "        -9.64712165e-03,  -1.98591035e-02,  -1.32136559e-02,\n",
       "         7.97899254e-03,  -1.76088810e-02,   9.19441786e-03,\n",
       "         1.30802142e-02,  -3.85359419e-03,   1.86733739e-03,\n",
       "        -5.96518070e-03,  -3.07163727e-02,  -3.04672867e-03,\n",
       "        -3.46868881e-04,  -1.29565294e-03,  -4.47260169e-03,\n",
       "        -1.29292896e-02,  -1.05356863e-02,  -9.16024856e-03,\n",
       "         6.08767197e-03,  -3.75504000e-03,  -2.08706614e-02,\n",
       "        -6.74075307e-03,  -1.62283499e-02,  -1.54837407e-02,\n",
       "        -4.45737224e-03,  -2.18946021e-02,  -7.09120464e-03,\n",
       "        -2.59322841e-02,  -7.19473930e-03,  -2.38050371e-02,\n",
       "        -2.12035086e-02,  -1.92295481e-02,  -1.78258196e-02,\n",
       "        -2.89904419e-02,  -2.11317427e-02,  -1.59252994e-02,\n",
       "         1.15247713e-02,  -6.23733690e-03,  -2.34362725e-02,\n",
       "        -2.94410121e-02,  -2.90733539e-02,  -2.31353957e-02,\n",
       "        -2.56022997e-02,  -2.99183521e-02,  -3.48845944e-02,\n",
       "        -2.37278938e-02,  -1.23830158e-02,  -8.28807056e-03,\n",
       "        -7.39323627e-03,  -2.71228235e-02,  -1.66217834e-02,\n",
       "        -2.01343931e-02,  -7.25648087e-03,  -1.39272353e-02,\n",
       "        -6.12456305e-03,  -1.73326526e-02,  -2.00424399e-02,\n",
       "        -6.42115856e-03,  -1.77380182e-02,   2.44801558e-05,\n",
       "        -2.94576529e-02,   3.32167302e-03,  -2.08815038e-02,\n",
       "        -1.13182077e-02,  -1.59333460e-02,  -1.49212936e-02,\n",
       "         5.97879477e-03,  -1.84684750e-02,  -2.37341877e-02,\n",
       "        -3.12264990e-02,   5.03906514e-03,  -3.30699719e-02,\n",
       "        -2.31159870e-02,  -9.83368699e-03,  -2.43863855e-02,\n",
       "        -1.25425290e-02,  -2.47525666e-02,  -9.63981543e-03,\n",
       "        -1.55018885e-02,  -9.93501674e-03,  -1.19379470e-02,\n",
       "        -5.87523589e-03,  -1.70155372e-02,  -2.29082517e-02,\n",
       "        -1.84413474e-02,  -1.43948747e-02,  -1.95573717e-02,\n",
       "        -1.57539565e-02,  -1.90414693e-02,  -9.15751979e-03,\n",
       "        -2.89104711e-02,  -1.02876564e-02,  -2.83453409e-02,\n",
       "        -1.30684981e-02,  -5.12228906e-03,  -1.68853626e-02,\n",
       "        -1.10401753e-02,  -7.05094775e-03,   1.51731307e-02], dtype=float32)"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def do_test():\n",
    "    reader = create_reader(data['test']['file'], is_training=False)\n",
    "    evaluate(reader, z)\n",
    "do_test()\n",
    "z.classify.b.value"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following block of code illustrates how to evaluate a single sequence. Additionally we show how one can pass in the information using NumPy arrays."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[178, 429, 444, 619, 937, 851, 752, 179]\n",
      "(8, 129)\n",
      "[128 128 128  48 110 128  78 128]\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[('BOS', 'O'),\n",
       " ('flights', 'O'),\n",
       " ('from', 'O'),\n",
       " ('new', 'B-fromloc.city_name'),\n",
       " ('york', 'I-fromloc.city_name'),\n",
       " ('to', 'O'),\n",
       " ('seattle', 'B-toloc.city_name'),\n",
       " ('EOS', 'O')]"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# load dictionaries\n",
    "query_wl = [line.rstrip('\\n') for line in open(data['query']['file'])]\n",
    "slots_wl = [line.rstrip('\\n') for line in open(data['slots']['file'])]\n",
    "query_dict = {query_wl[i]:i for i in range(len(query_wl))}\n",
    "slots_dict = {slots_wl[i]:i for i in range(len(slots_wl))}\n",
    "\n",
    "# let's run a sequence through\n",
    "seq = 'BOS flights from new york to seattle EOS'\n",
    "w = [query_dict[w] for w in seq.split()] # convert to word indices\n",
    "print(w)\n",
    "onehot = np.zeros([len(w),len(query_dict)], np.float32)\n",
    "for t in range(len(w)):\n",
    "    onehot[t,w[t]] = 1\n",
    "\n",
    "#x = C.sequence.input_variable(vocab_size)\n",
    "pred = z(x).eval({x:[onehot]})[0]\n",
    "print(pred.shape)\n",
    "best = np.argmax(pred,axis=1)\n",
    "print(best)\n",
    "list(zip(seq.split(),[slots_wl[s] for s in best]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Modifying the Model\n",
    "\n",
    "In the following, you will be given tasks to practice modifying CNTK configurations.\n",
    "The solutions are given at the end of this document... but please try without!\n",
    "\n",
    "**A Word About [`Sequential()`](https://www.cntk.ai/pythondocs/layerref.html#sequential)**\n",
    "\n",
    "Before jumping to the tasks, let's have a look again at the model we just ran.\n",
    "The model is described in what we call *function-composition style*.\n",
    "```python\n",
    "        Sequential([\n",
    "            Embedding(emb_dim),\n",
    "            Recurrence(LSTM(hidden_dim), go_backwards=False),\n",
    "            Dense(num_labels)\n",
    "        ])\n",
    "```\n",
    "You may be familiar with the \"sequential\" notation from other neural-network toolkits.\n",
    "If not, [`Sequential()`](https://www.cntk.ai/pythondocs/layerref.html#sequential) is a powerful operation that,\n",
    "in a nutshell, allows to compactly express a very common situation in neural networks\n",
    "where an input is processed by propagating it through a progression of layers.\n",
    "`Sequential()` takes an list of functions as its argument,\n",
    "and returns a *new* function that invokes these functions in order,\n",
    "each time passing the output of one to the next.\n",
    "For example,\n",
    "```python\n",
    "\tFGH = Sequential ([F,G,H])\n",
    "    y = FGH (x)\n",
    "```\n",
    "means the same as\n",
    "```\n",
    "    y = H(G(F(x))) \n",
    "```\n",
    "This is known as [\"function composition\"](https://en.wikipedia.org/wiki/Function_composition),\n",
    "and is especially convenient for expressing neural networks, which often have this form:\n",
    "\n",
    "         +-------+   +-------+   +-------+\n",
    "    x -->|   F   |-->|   G   |-->|   H   |--> y\n",
    "         +-------+   +-------+   +-------+\n",
    "\n",
    "Coming back to our model at hand, the `Sequential` expression simply\n",
    "says that our model has this form:\n",
    "\n",
    "         +-----------+   +----------------+   +------------+\n",
    "    x -->| Embedding |-->| Recurrent LSTM |-->| DenseLayer |--> y\n",
    "         +-----------+   +----------------+   +------------+"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task 1: Add Batch Normalization\n",
    "\n",
    "We now want to add new layers to the model, specifically batch normalization.\n",
    "\n",
    "Batch normalization is a popular technique for speeding up convergence.\n",
    "It is often used for image-processing setups. But could it work for recurrent models, too?\n",
    "\n",
    "> Note: training with Batch Normalization is currently only supported on GPU.\n",
    "\n",
    "So your task will be to insert batch-normalization layers before and after the recurrent LSTM layer.\n",
    "If you have completed the [hands-on labs on image processing](https://github.com/Microsoft/CNTK/blob/release/2.7/Tutorials/CNTK_201B_CIFAR-10_ImageHandsOn.ipynb),\n",
    "you may remember that the [batch-normalization layer](https://www.cntk.ai/pythondocs/layerref.html#batchnormalization-layernormalization-stabilizer) has this form:\n",
    "```\n",
    "    BatchNormalization()\n",
    "```\n",
    "So please go ahead and modify the configuration and see what happens.\n",
    "\n",
    "If everything went right, you will notice improved convergence speed (`loss` and `metric`)\n",
    "compared to the previous configuration."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Your task: Add batch normalization\n",
    "def create_model():\n",
    "    with C.layers.default_options(initial_state=0.1):\n",
    "        return C.layers.Sequential([\n",
    "            C.layers.Embedding(emb_dim),\n",
    "            C.layers.Recurrence(C.layers.LSTM(hidden_dim), go_backwards=False),\n",
    "            C.layers.Dense(num_labels)\n",
    "        ])\n",
    "\n",
    "# Enable these when done:\n",
    "#do_train()\n",
    "#do_test()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task 2: Add a Lookahead \n",
    "\n",
    "Our recurrent model suffers from a structural deficit:\n",
    "Since the recurrence runs from left to right, the decision for a slot label\n",
    "has no information about upcoming words. The model is a bit lopsided.\n",
    "Your task will be to modify the model such that\n",
    "the input to the recurrence consists not only of the current word, but also of the next one\n",
    "(lookahead).\n",
    "\n",
    "Your solution should be in function-composition style.\n",
    "Hence, you will need to write a Python function that does the following:\n",
    "\n",
    "* takes no input arguments\n",
    "* creates a placeholder (sequence) variable\n",
    "* computes the \"next value\" in this sequence using the `sequence.future_value()` operation and\n",
    "* concatenates the current and the next value into a vector of twice the embedding dimension using `splice()`\n",
    "\n",
    "and then insert this function into `Sequential()`'s list right after the embedding layer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Your task: Add lookahead\n",
    "def create_model():\n",
    "    with C.layers.default_options(initial_state=0.1):\n",
    "        return C.layers.Sequential([\n",
    "            C.layers.Embedding(emb_dim),\n",
    "            C.layers.Recurrence(C.layers.LSTM(hidden_dim), go_backwards=False),\n",
    "            C.layers.Dense(num_labels)\n",
    "        ])\n",
    "    \n",
    "# Enable these when done:\n",
    "#do_train()\n",
    "#do_test()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task 3: Bidirectional Recurrent Model\n",
    "\n",
    "Aha, knowledge of future words help. So instead of a one-word lookahead,\n",
    "why not look ahead until all the way to the end of the sentence, through a backward recurrence?\n",
    "Let us create a bidirectional model!\n",
    "\n",
    "Your task is to implement a new layer that\n",
    "performs both a forward and a backward recursion over the data, and\n",
    "concatenates the output vectors.\n",
    "\n",
    "Note, however, that this differs from the previous task in that\n",
    "the bidirectional layer contains learnable model parameters.\n",
    "In function-composition style,\n",
    "the pattern to implement a layer with model parameters is to write a *factory function*\n",
    "that creates a *function object*.\n",
    "\n",
    "A function object, also known as [*functor*](https://en.wikipedia.org/wiki/Function_object), is an object that is both a function and an object.\n",
    "Which means nothing else that it contains data yet still can be invoked as if it was a function.\n",
    "\n",
    "For example, `Dense(outDim)` is a factory function that returns a function object that contains\n",
    "a weight matrix `W`, a bias `b`, and another function to compute \n",
    "`input @ W + b.` (This is using \n",
    "[Python 3.5 notation for matrix multiplication](https://docs.python.org/3/whatsnew/3.5.html#whatsnew-pep-465).\n",
    "In Numpy syntax it is `input.dot(W) + b`).\n",
    "E.g. saying `Dense(1024)` will create this function object, which can then be used\n",
    "like any other function, also immediately: `Dense(1024)(x)`. \n",
    "\n",
    "Let's look at an example for further clarity: Let us implement a new layer that combines\n",
    "a linear layer with a subsequent batch normalization. \n",
    "To allow function composition, the layer needs to be realized as a factory function,\n",
    "which could look like this:\n",
    "\n",
    "```python\n",
    "def DenseLayerWithBN(dim):\n",
    "    F = Dense(dim)\n",
    "    G = BatchNormalization()\n",
    "    x = placeholder()\n",
    "    apply_x = G(F(x))\n",
    "    return apply_x\n",
    "```\n",
    "\n",
    "Invoking this factory function will create `F`, `G`, `x`, and `apply_x`. In this example, `F` and `G` are function objects themselves, and `apply_x` is the function to be applied to the data.\n",
    "Thus, e.g. calling `DenseLayerWithBN(1024)` will\n",
    "create an object containing a linear-layer function object called `F`, a batch-normalization function object `G`,\n",
    "and `apply_x` which is the function that implements the actual operation of this layer\n",
    "using `F` and `G`. It will then return `apply_x`. To the outside, `apply_x` looks and behaves\n",
    "like a function. Under the hood, however, `apply_x` retains access to its specific instances of `F` and `G`.\n",
    "\n",
    "Now back to our task at hand. You will now need to create a factory function,\n",
    "very much like the example above.\n",
    "You shall create a factory function\n",
    "that creates two recurrent layer instances (one forward, one backward), and then defines an `apply_x` function\n",
    "which applies both layer instances to the same `x` and concatenate the two results.\n",
    "\n",
    "Alright, give it a try! To know how to realize a backward recursion in CNTK,\n",
    "please take a hint from how the forward recursion is done.\n",
    "Please also do the following:\n",
    "* remove the one-word lookahead you added in the previous task, which we aim to replace; and\n",
    "* make sure each LSTM is using `hidden_dim//2` outputs to keep the total number of model parameters limited."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Your task: Add bidirectional recurrence\n",
    "def create_model():\n",
    "    with C.layers.default_options(initial_state=0.1):  \n",
    "        return C.layers.Sequential([\n",
    "            C.layers.Embedding(emb_dim),\n",
    "            C.layers.Recurrence(C.layers.LSTM(hidden_dim), go_backwards=False),\n",
    "            C.layers.Dense(num_labels)\n",
    "        ])\n",
    "\n",
    "# Enable these when done:\n",
    "#do_train()\n",
    "#do_test()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The bidirectional model has 40% less parameters than the lookahead one. However, if you go back and look closely\n",
    "you may find that the lookahead one trained about 30% faster.\n",
    "This is because the lookahead model has both less horizontal dependencies (one instead of two\n",
    "recurrences) and larger matrix products, and can thus achieve higher parallelism."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Solution 1: Adding Batch Normalization**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training 721479 parameters in 6 parameter tensors.\n",
      "Learning rate per minibatch: 0.020999999999999998\n",
      "Finished Epoch[1 of 10]: [Training] loss = 1.740198 * 18010, metric = 28.02% * 18010 4.223s (4264.7 samples/s);\n",
      "Finished Epoch[2 of 10]: [Training] loss = 0.665177 * 18051, metric = 14.30% * 18051 3.917s (4608.4 samples/s);\n",
      "Finished Epoch[3 of 10]: [Training] loss = 0.526256 * 17941, metric = 11.34% * 17941 3.898s (4602.6 samples/s);\n",
      "Finished Epoch[4 of 10]: [Training] loss = 0.395405 * 18059, metric = 8.22% * 18059 4.061s (4446.9 samples/s);\n",
      "Learning rate per minibatch: 0.010499999999999999\n",
      "Finished Epoch[5 of 10]: [Training] loss = 0.293512 * 17957, metric = 6.20% * 17957 3.996s (4493.7 samples/s);\n",
      "Finished Epoch[6 of 10]: [Training] loss = 0.264932 * 18021, metric = 5.73% * 18021 3.931s (4584.3 samples/s);\n",
      "Finished Epoch[7 of 10]: [Training] loss = 0.217258 * 17980, metric = 4.69% * 17980 3.941s (4562.3 samples/s);\n",
      "Finished Epoch[8 of 10]: [Training] loss = 0.209614 * 18025, metric = 4.55% * 18025 4.105s (4391.0 samples/s);\n",
      "Finished Epoch[9 of 10]: [Training] loss = 0.165851 * 17956, metric = 3.84% * 17956 3.963s (4530.9 samples/s);\n",
      "Finished Epoch[10 of 10]: [Training] loss = 0.157653 * 18039, metric = 3.41% * 18039 4.051s (4453.0 samples/s);\n",
      "Finished Evaluation [1]: Minibatch[1-23]: metric = 0.34% * 10984;\n"
     ]
    }
   ],
   "source": [
    "def create_model():\n",
    "    with C.layers.default_options(initial_state=0.1):\n",
    "        return C.layers.Sequential([\n",
    "            C.layers.Embedding(emb_dim),\n",
    "            #C.layers.BatchNormalization(), #Remove this comment if running on GPU\n",
    "            C.layers.Recurrence(C.layers.LSTM(hidden_dim), go_backwards=False),\n",
    "            #C.layers.BatchNormalization(), #Remove this comment if running on GPU\n",
    "            C.layers.Dense(num_labels)\n",
    "        ])\n",
    "\n",
    "do_train()\n",
    "do_test()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Solution 2: Add a Lookahead**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training 901479 parameters in 6 parameter tensors.\n",
      "Learning rate per minibatch: 0.020999999999999998\n",
      "Finished Epoch[1 of 10]: [Training] loss = 1.618925 * 18010, metric = 26.40% * 18010 4.567s (3943.5 samples/s);\n",
      "Finished Epoch[2 of 10]: [Training] loss = 0.572762 * 18051, metric = 12.46% * 18051 5.560s (3246.6 samples/s);\n",
      "Finished Epoch[3 of 10]: [Training] loss = 0.420728 * 17941, metric = 8.57% * 17941 5.254s (3414.7 samples/s);\n",
      "Finished Epoch[4 of 10]: [Training] loss = 0.297996 * 18059, metric = 6.28% * 18059 5.697s (3169.9 samples/s);\n",
      "Learning rate per minibatch: 0.010499999999999999\n",
      "Finished Epoch[5 of 10]: [Training] loss = 0.224015 * 17957, metric = 4.81% * 17957 5.612s (3199.8 samples/s);\n",
      "Finished Epoch[6 of 10]: [Training] loss = 0.207126 * 18021, metric = 4.61% * 18021 5.487s (3284.3 samples/s);\n",
      "Finished Epoch[7 of 10]: [Training] loss = 0.170268 * 17980, metric = 3.69% * 17980 5.538s (3246.7 samples/s);\n",
      "Finished Epoch[8 of 10]: [Training] loss = 0.164910 * 18025, metric = 3.65% * 18025 5.467s (3297.1 samples/s);\n",
      "Finished Epoch[9 of 10]: [Training] loss = 0.126314 * 17956, metric = 2.92% * 17956 4.340s (4137.3 samples/s);\n",
      "Finished Epoch[10 of 10]: [Training] loss = 0.122896 * 18039, metric = 2.67% * 18039 4.218s (4276.7 samples/s);\n",
      "Finished Evaluation [1]: Minibatch[1-23]: metric = 0.38% * 10984;\n"
     ]
    }
   ],
   "source": [
    "def OneWordLookahead():\n",
    "    x = C.placeholder()\n",
    "    apply_x = C.splice(x, C.sequence.future_value(x))\n",
    "    return apply_x\n",
    "\n",
    "def create_model():\n",
    "    with C.layers.default_options(initial_state=0.1):\n",
    "        return C.layers.Sequential([\n",
    "            C.layers.Embedding(emb_dim),\n",
    "            OneWordLookahead(),\n",
    "            C.layers.Recurrence(C.layers.LSTM(hidden_dim), go_backwards=False),\n",
    "            C.layers.Dense(num_labels)        \n",
    "        ])\n",
    "\n",
    "do_train()\n",
    "do_test()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Solution 3: Bidirectional Recurrent Model**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training 541479 parameters in 9 parameter tensors.\n",
      "Learning rate per minibatch: 0.020999999999999998\n",
      "Finished Epoch[1 of 10]: [Training] loss = 1.886776 * 18010, metric = 30.06% * 18010 7.619s (2363.8 samples/s);\n",
      "Finished Epoch[2 of 10]: [Training] loss = 0.683211 * 18051, metric = 14.83% * 18051 7.325s (2464.3 samples/s);\n",
      "Finished Epoch[3 of 10]: [Training] loss = 0.521379 * 17941, metric = 11.42% * 17941 7.265s (2469.5 samples/s);\n",
      "Finished Epoch[4 of 10]: [Training] loss = 0.394698 * 18059, metric = 8.11% * 18059 7.567s (2386.5 samples/s);\n",
      "Learning rate per minibatch: 0.010499999999999999\n",
      "Finished Epoch[5 of 10]: [Training] loss = 0.288926 * 17957, metric = 6.06% * 17957 7.386s (2431.2 samples/s);\n",
      "Finished Epoch[6 of 10]: [Training] loss = 0.267000 * 18021, metric = 5.73% * 18021 7.401s (2434.9 samples/s);\n",
      "Finished Epoch[7 of 10]: [Training] loss = 0.215379 * 17980, metric = 4.69% * 17980 7.269s (2473.5 samples/s);\n",
      "Finished Epoch[8 of 10]: [Training] loss = 0.206970 * 18025, metric = 4.37% * 18025 7.333s (2458.1 samples/s);\n",
      "Finished Epoch[9 of 10]: [Training] loss = 0.160564 * 17956, metric = 3.39% * 17956 7.196s (2495.3 samples/s);\n",
      "Finished Epoch[10 of 10]: [Training] loss = 0.154584 * 18039, metric = 3.20% * 18039 7.337s (2458.6 samples/s);\n",
      "Finished Evaluation [1]: Minibatch[1-23]: metric = 0.38% * 10984;\n"
     ]
    }
   ],
   "source": [
    "def BiRecurrence(fwd, bwd):\n",
    "    F = C.layers.Recurrence(fwd)\n",
    "    G = C.layers.Recurrence(bwd, go_backwards=True)\n",
    "    x = C.placeholder()\n",
    "    apply_x = C.splice(F(x), G(x))\n",
    "    return apply_x \n",
    "\n",
    "def create_model():\n",
    "    with C.layers.default_options(initial_state=0.1):\n",
    "        return C.layers.Sequential([\n",
    "            C.layers.Embedding(emb_dim),\n",
    "            BiRecurrence(C.layers.LSTM(hidden_dim//2), \n",
    "                                  C.layers.LSTM(hidden_dim//2)),\n",
    "            C.layers.Dense(num_labels)\n",
    "        ])\n",
    "\n",
    "do_train()\n",
    "do_test()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task overview: Sequence classification\n",
    "\n",
    "We will reuse the same data for this task. We revisit a sample again:\n",
    "\n",
    "    19  |S0 178:1 |# BOS      |S1 14:1 |# flight  |S2 128:1 |# O\n",
    "    19  |S0 770:1 |# show                         |S2 128:1 |# O\n",
    "    19  |S0 429:1 |# flights                      |S2 128:1 |# O\n",
    "    19  |S0 444:1 |# from                         |S2 128:1 |# O\n",
    "    19  |S0 272:1 |# burbank                      |S2 48:1  |# B-fromloc.city_name\n",
    "    19  |S0 851:1 |# to                           |S2 128:1 |# O\n",
    "    19  |S0 789:1 |# st.                          |S2 78:1  |# B-toloc.city_name\n",
    "    19  |S0 564:1 |# louis                        |S2 125:1 |# I-toloc.city_name\n",
    "    19  |S0 654:1 |# on                           |S2 128:1 |# O\n",
    "    19  |S0 601:1 |# monday                       |S2 26:1  |# B-depart_date.day_name\n",
    "    19  |S0 179:1 |# EOS                          |S2 128:1 |# O\n",
    "\n",
    "The task of the neural network is to look at the query (column `S0`) and predict the\n",
    "intent of the sequence (column `S1`). We will ignore the slot-tags (column `S2`) this time.\n",
    "\n",
    "\n",
    "### Model Creation\n",
    "\n",
    "The model we will use is a recurrent model consisting of an embedding layer,\n",
    "a recurrent LSTM cell, and a dense layer to compute the posterior probabilities. Though very similar to the slot tagging model in this case we look only at the embedding from the last layer:\n",
    "\n",
    "\n",
    "    intent                                                \"flight\"\n",
    "                                                              ^          \n",
    "                                                              |          \n",
    "                                                          +-------+  \n",
    "                                                          | Dense |   ...\n",
    "                                                          +-------+  \n",
    "                                                              ^         \n",
    "                                                              |          \n",
    "              +------+   +------+   +------+   +------+   +------+   \n",
    "         0 -->| LSTM |-->| LSTM |-->| LSTM |-->| LSTM |-->| LSTM |-->...\n",
    "              +------+   +------+   +------+   +------+   +------+   \n",
    "                  ^          ^          ^          ^          ^\n",
    "                  |          |          |          |          |\n",
    "              +-------+  +-------+  +-------+  +-------+  +-------+\n",
    "              | Embed |  | Embed |  | Embed |  | Embed |  | Embed |  ...\n",
    "              +-------+  +-------+  +-------+  +-------+  +-------+\n",
    "                  ^          ^          ^          ^          ^\n",
    "                  |          |          |          |          |\n",
    "    w      ------>+--------->+--------->+--------->+--------->+------... \n",
    "                 BOS      \"show\"    \"flights\"    \"from\"   \"burbank\"\n",
    "\n",
    "Or, as a CNTK network description. Please have a quick look and match it with the description above:\n",
    "(descriptions of these functions can be found at: [the layers reference](http://cntk.ai/pythondocs/layerref.html))\n",
    "\n",
    "#### Points to note:\n",
    "- The first difference between this model with the previous one is with regards to the specification of the label `y`. Since there is only one label per sequence, we use `C.input_variable`.\n",
    "- The second difference is the use of [Stabilizer](http://ieeexplore.ieee.org/document/7472719/). We stabilize the embedded output. The stabilizer adds an additional scalar parameter to the learning that can help our network converge more quickly during training. \n",
    "- The third difference is the use of a layer function called `Fold`. As shown in the model above we want the model to have LSTM recurrence except the final one, we set up an LSTM recurrence. The final recurrence will be a Fold operation where we pick the hidden state from the last LSTM block and use it for classification of the entire sequence.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# number of words in vocab, slot labels, and intent labels\n",
    "vocab_size = 943 ; num_intents = 26    \n",
    "\n",
    "# model dimensions\n",
    "emb_dim    = 150\n",
    "hidden_dim = 300\n",
    "\n",
    "# Create the containers for input feature (x) and the label (y)\n",
    "x = C.sequence.input_variable(vocab_size)\n",
    "y = C.input_variable(num_intents)\n",
    "\n",
    "def create_model():\n",
    "    with C.layers.default_options(initial_state=0.1):\n",
    "        return C.layers.Sequential([\n",
    "            C.layers.Embedding(emb_dim, name='embed'),\n",
    "            C.layers.Stabilizer(),\n",
    "            C.layers.Fold(C.layers.LSTM(hidden_dim), go_backwards=False),\n",
    "            C.layers.Dense(num_intents, name='classify')\n",
    "        ])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We create the `criterion` function with the new model and correspondingly update the placeholder. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Composite(Combine): Input('Input14623', [#, *], [26]), Placeholder('labels', [???], [???]) -> Output('Block14593_Output_0', [#], [1]), Output('Block14613_Output_0', [#], [])"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "criterion = create_criterion_function(create_model())\n",
    "criterion.replace_placeholders({criterion.placeholders[0]: C.sequence.input_variable(num_intents)})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The same train code can be used except for the fact that in this case we provide the `intent` tags as the labels."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training 690477 parameters in 7 parameter tensors.\n",
      "Learning rate per minibatch: 0.020999999999999998\n",
      "Finished Epoch[1 of 5]: [Training] loss = 0.585308 * 18004, metric = 14.42% * 18004 79.421s (226.7 samples/s);\n",
      "Finished Epoch[2 of 5]: [Training] loss = 0.153843 * 17998, metric = 3.92% * 17998 74.394s (241.9 samples/s);\n",
      "Finished Epoch[3 of 5]: [Training] loss = 0.081964 * 18000, metric = 2.17% * 18000 59.060s (304.8 samples/s);\n",
      "Finished Epoch[4 of 5]: [Training] loss = 0.069163 * 18000, metric = 1.92% * 18000 58.621s (307.1 samples/s);\n",
      "Learning rate per minibatch: 0.010499999999999999\n",
      "Finished Epoch[5 of 5]: [Training] loss = 0.018180 * 17998, metric = 0.47% * 17998 58.345s (308.5 samples/s);\n"
     ]
    }
   ],
   "source": [
    "def do_train():\n",
    "    global z\n",
    "    z = create_model()\n",
    "    reader = create_reader(data['train']['file'], is_training=True)\n",
    "    train(reader, z, 5, 'intent')\n",
    "do_train()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can measure the model accuracy by going through all the examples in the test set and using the C.eval.Evaluator method. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Finished Evaluation [1]: Minibatch[1-23]: metric = 0.00% * 893;\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "array([ 0.51183742, -0.29180217, -0.41856512, -0.28958377, -0.72160971,\n",
       "       -0.24883319,  0.02916328, -0.0852626 ,  0.19014142, -0.41936243,\n",
       "       -0.23639332, -0.29029393, -0.74258387, -0.12296562,  0.34665295,\n",
       "       -0.46388549, -0.73981428, -0.2296015 , -0.78151304, -0.24418215,\n",
       "       -0.52702737,  0.10101102, -0.27433836, -0.24181543, -0.39551306,\n",
       "        0.30569023], dtype=float32)"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def do_test():\n",
    "    reader = create_reader(data['test']['file'], is_training=False)\n",
    "    evaluate(reader, z, 'intent')\n",
    "do_test()\n",
    "z.classify.b.value"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following block of code illustrates how to evaluate a single sequence. Additionally, we show how one can pass in the information using NumPy arrays."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "14\n",
      "BOS flights from new york to seattle EOS : flight\n"
     ]
    }
   ],
   "source": [
    "# load dictionaries\n",
    "query_wl = [line.rstrip('\\n') for line in open(data['query']['file'])]\n",
    "intent_wl = [line.rstrip('\\n') for line in open(data['intent']['file'])]\n",
    "query_dict = {query_wl[i]:i for i in range(len(query_wl))}\n",
    "intent_dict = {intent_wl[i]:i for i in range(len(intent_wl))}\n",
    "\n",
    "# let's run a sequence through\n",
    "seq = 'BOS flights from new york to seattle EOS'\n",
    "w = [query_dict[w] for w in seq.split()] # convert to word indices\n",
    "onehot = np.zeros([len(w),len(query_dict)], np.float32)\n",
    "for t in range(len(w)):\n",
    "    onehot[t,w[t]] = 1\n",
    "\n",
    "pred = z(x).eval({x:[onehot]})[0]\n",
    "best = np.argmax(pred)\n",
    "print(best)\n",
    "print(seq, \":\", intent_wl[best])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Task 4: Use all hidden states for sequence classification**\n",
    "\n",
    "In the last model, we looked at the output of the last LSTM block. There is another way to model, where we aggregate the output from all the LSTM blocks and use the aggregated output to the final `Dense` layer.\n",
    "\n",
    "So your task will be to replace the `C.layers.Fold` with `C.layers.Recurrence` layer function. This is the explicit way of setting up recurrence. You will aggregate all the intermediate outputs from LSTM blocks using `C.sequence.reduce_sum`. Note: this is different from last model where we looked only at the output of the last LSTM block. \n",
    "\n",
    "So please go ahead and modify the configuration and see what happens.\n",
    "\n",
    "If everything went right, you will notice improved accuracy (`metric`). The solution is presented right after, but we suggest you refrain from looking at the solution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Replace the line with Fold operation\n",
    "def create_model():\n",
    "    with C.layers.default_options(initial_state=0.1):\n",
    "        return C.layers.Sequential([\n",
    "            C.layers.Embedding(emb_dim, name='embed'),\n",
    "            C.layers.Stabilizer(),\n",
    "            C.layers.Fold(C.layers.LSTM(hidden_dim), go_backwards=False),\n",
    "            C.layers.Dense(num_intents, name='classify')\n",
    "        ])\n",
    "# Enable these when done:\n",
    "#do_train()\n",
    "#do_test()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Aggregating all the intermediate states improves the accuracy for the same number of iterations without any significant increase in the computation time."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Solution 4: Use all hidden states for sequence classification**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training 690477 parameters in 7 parameter tensors.\n",
      "Learning rate per minibatch: 0.020999999999999998\n",
      "Finished Epoch[1 of 5]: [Training] loss = 0.193482 * 18004, metric = 4.07% * 18004 56.904s (316.4 samples/s);\n",
      "Finished Epoch[2 of 5]: [Training] loss = 0.028983 * 17998, metric = 0.64% * 17998 73.820s (243.8 samples/s);\n",
      "Finished Epoch[3 of 5]: [Training] loss = 0.059269 * 18000, metric = 1.24% * 18000 71.494s (251.8 samples/s);\n",
      "Finished Epoch[4 of 5]: [Training] loss = 0.005993 * 18000, metric = 0.18% * 18000 61.856s (291.0 samples/s);\n",
      "Learning rate per minibatch: 0.010499999999999999\n",
      "Finished Epoch[5 of 5]: [Training] loss = 0.000344 * 17998, metric = 0.01% * 17998 57.114s (315.1 samples/s);\n",
      "Finished Evaluation [1]: Minibatch[1-23]: metric = 0.00% * 893;\n"
     ]
    }
   ],
   "source": [
    "def create_model():\n",
    "    with C.layers.default_options(initial_state=0.1):\n",
    "        return C.layers.Sequential([\n",
    "            C.layers.Embedding(emb_dim, name='embed'),\n",
    "            C.layers.Stabilizer(),\n",
    "            C.sequence.reduce_sum(C.layers.Recurrence(C.layers.LSTM(hidden_dim), go_backwards=False)),\n",
    "            C.layers.Dense(num_intents, name='classify')\n",
    "        ])\n",
    "    \n",
    "do_train()\n",
    "do_test()"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
